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CLONE-ARRAY POOLED SHOTGUN MAPPING AND SEQUENCING: 
DESIGN AND ANALYSIS OF EXPERIMENTS 

MIKLOS CSUROS, BINGSHAN LI, AND ALEKSANDAR MILOSAVLJEVIC 

Abstract. This paper studies sequencing and mapping methods that rely solely on pooling and 
shotgun sequencing of clones. First, we scrutinize and improve the recently proposed Clone- Array 
Pooled Shotgun Sequencing (CAPSS) method, which delivers a BAC-linked assembly of a whole 
genome sequence. Secondly, we introduce a novel physical mapping method, called Clone-Array 
Pooled Shotgun Mapping (CAPS-MAP), which computes the physical ordering of BACs in a random 
library. Both CAPSS and CAPS-MAP construct subclone libraries from pooled genomic BAC 
clones. 

We propose algorithmic and experimental improvements that make CAPSS a viable option for 
sequencing a set of BACs. We provide the first probabilistic model of CAPSS sequencing progress. 
The model leads to theoretical results supporting previous, less formal arguments on the practi- 
cality of CAPSS. We demonstrate the usefulness of CAPS-MAP for clone overlap detection with 
a probabilistic analysis, and a simulated assembly of the Drosophila melanogaster genome. Our 
analysis indicates that CAPS-MAP is well-suited for detecting BAC overlaps in a highly redundant 
library, relying on a low amount of shotgun sequence information. Consequently, it is a practical 
method for computing the physical ordering of clones in a random library, without requiring ad- 
ditional clone fingerprinting. Since CAPS-MAP requires only shotgun sequence reads, it can be 
seamlessly incorporated into a sequencing project with almost no experimental overhead. 



1. Introduction 

In a hierarchical approach to large genome sequencing, one first breaks many genome copies 
into random fragments. A library is constructed by cloning the fragments, typically as Bacterial 
Artificial Chromosome inserts (BACs). Some BACs in the library are selected for complete se- 
quencing. Each selected BAC sequence is assembled individually using the shotgun method: a 
subclone library is prepared by cloning short fragments of the BAC. Subsequently, sequence reads 
are produced from a sufficient number of randomly chosen subclones. The reads are assembled al- 
gorithmically into the BAC sequence. An alternative to the hierarchical, or clone-by- clone, strategy 
is the whole-genome shotgun approach (Weber and Myers 1997), which employs a few (essentially 



1-3) subclone libraries prepared from the entire genome, without resorting to an intermediate BAC 
library. The main advantage of the whole-genome approach is that it eliminates the need to prepare 
tens of thousands of subclone libraries to sequence a mammalian genome. However, it is generally 
an inadequate strategy for finishing the assembly of such large repeat-rich genomes. For a review 



of contemporary sequencing methodologies, see, e.g.. Green (2001) 

A new BAC-based sequencing strategy, called Clone- Array Pooled Shotgun Sequencing (CAPSS), 
was proposed recently (jCai et al. 2001|1 . CAPSS assembles the complete sequences of individual 
BACs as does the clone-by-clone approach, but requires a much smaller number of subclone library 
preparations. The strategy is currently being applied for the first time on a genome scale in the 
context of sequencing the honey bee genome. This paper provides the theory for the design and 
analysis of pooling-based genome projects. It also introduces the CAPS-MAP method for physical 
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Figure 1. CAPSS strategy for arrayed BACs. DNA extracted from each cfone is pooled 
together with other clones in the same row and column. Subclone libraries are prepared 
from the pools, and shotgun sequences are collected from the sublibraries. Sequences are 
assembled into contigs. If a contig contains sequences from a row and a column pool's 
sublibrary, the contig is assigned to the BAG at the intersection of the row and the column. 

mapping, and transversal pooling designs for both CAPSS and CAPS-MAP, thereby laying the 
theoretical foundation for pooling-based genome-scale sequencing projects. 

In a clone-by-clone approach, BACs are sequenced independently: one subclone library is con- 
structed for every clone. In contrast, DNA from BACs are pooled together in a CAPSS approach, 
and subclone libraries are prepared from the pools. A CAPSS experiment is designed so that the 
number of subclone libraries is much smaller than the number of clones, yet the pooling design 
enables the assembly of individual clone sequences. In what follows, by pooled shotgun (CAPS) 
sequences we mean shotgun sequence reads collected from a subclone library that was constructed 
using pooled BACs. For the computational aspects of sequence assembly, pooled shotgun sequences 
are random subsequences originating from a set of clone sequences. 

The original CAPSS proposal of |Cai et al. (2001) relied on a simple rectangular design defined 
by an array layout of BACs (FigurellJ). The pools correspond to the rows and columns. An array 
layout reduces the number of shotgun library preparations to the square root of the number of 
BACs when compared to clone-by-clone sequencing. This reduction can be important in case of a 
mammalian genome, for which even a minimally overlapping tiling path contains between twenty 
and thirty thousand clones ()IHGSC 2001 j) . 

This paper has two goals. First, after pointing out some shortcomings of the original CAPSS 
proposal, we propose algorithmic and experimental improvements that make CAPSS a viable option 
for sequencing a set of BACs. Specifically, we apply transversal pooling designs to increase the 
accuracy of CAPSS, which we previously developed for the PGI method of comparative physical 
mapping that also uses pooled shotgun sequencing ( |Csuros and Milosavljevic 2002 ). We provide 



the first probabilistic model of CAPSS sequencing progress. The model leads to theoretical results 
supporting previous, less formal arguments on the practicality of CAPSS. 

The paper's second goal is to introduce the Clone-Array Pooled Shotgun Mapping ( CAPS-MAP) 
method to detect clone overlaps in a random BAC library. The information on clone overlaps is 
used to compute the physical ordering of clones in the library, without requiring additional clone 
fingerprinting. CAPS-MAP operates in the same experimental framework as CAPSS. It needs only 
shotgun sequences, which makes it a cost-effective method that can be seamlessly integrated into 
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a sequencing project with very little experimental overhead. We demonstrate the usefulness of 
CAPS-MAP for clone overlap detection with a probabilistic analysis. In addition to the theoretical 
results, we illustrate the method's performance in a simulated project using the Drosophila genome 
assembly. 



2. Transversal designs 

It was proposed by jCai et al. (2001)| that CAPSS be used in hybrid projects, combining whole- 
genome shotgun (WGS) and pooled shotgun (CAPS) sequences. The motivation is that the pooled 
shotgun sequences can provide the localization information for the whole-genome shotgun sequences 
so that the latter can be used for a clone-linked assembly. After WGS and CAPS sequences from a 
set of pools are assembled into contigs, the contigs need to be mapped to individual BACs. There 
are a few challenges to contig mapping. We mention here three main problems: false negatives, 
ambiguities, and false mapping. A false negative refers to a situation where a BAC is not sampled 
in a pool it is included in, due to the low number of CAPS sequences collected. A false negative 
for a simple rectangular design means that no contigs can be mapped to the BAC. Ambiguities 
and false mappings are caused by overlapping clones, or more generally, by clones that have highly 
similar regions. The mapping of a contig is ambiguous if it is not possible to decide which clones 
the contig should be assigned to, in cases where two or more clone sets are equally likely choices for 
the mapping. False mapping occurs when an insufficient number of CAPS sequences are collected, 
and a contig that covers overlapping BACs gets assigned to the wrong clone or clone set. 



One strategy used to overcome the mapping problems involves transversal pooling designs ( Csiiros and Milosavlje 



Du and Hwang 2000). For a transversal design with n pool sets, every clone is included in exactly 



one pool of each pool set, and any subset with two of those pools uniquely identifies the clone. 
Half of the pool sets are designated as column pools, and the other half as row pools to realize the 
design with the help of BAC arrays. Using a transversal double-array design (i.e., one with four 
pool sets), the same set of BACs is independently arrayed twice. Each of the two resulting arrays 
contains the same set of BACs. Thus, each BAC ends up being sampled in two column-pools and 
two row-pools. One of the arrays contains an arbitrary arrangement of BACs, while the other is 
"reshuffled" relative to the first. More generally, clones can be arranged on d reshuffled arrays using 
a transversal pooling design with n = 2d pool sets. 

The number of arrays in a transversal design may be adjusted to allow unambiguous and correct 
contig mapping for any redundancy in a BAC library. Specifically, it can be shown ( Csiiros and Milosavljevic 2002} 



|Du and Hwang 20 00 ) that a d-array transversal design can accurately resolve BACs at up to 
{2d — 1)X redundancy. We previously described and analyzed transversal designs in the context of 
pooled shotgun experiments ( Csiiros and Milosavljevic 2002) and compared their performance to 



other designs. Even though our analysis was performed for the Pooled Genomic Indexing (PGI) 
method in the context of comparative physical mapping, the results are generally valid for CAPSS 
and CAPS-MAP as well. Specifically, our results indicate that transversal designs reduce the 
frequency of false negatives and false mappings when compared to a simple rectangular design. 
Furthermore, when compared to other more complicated designs, they achieve an optimal balance 
between the number of shotgun library preparations and the frequency of contig mapping problems. 
Transversal designs also enjoy a practical advantage over more complicated combinatorial designs, 
in that they are readily implemented using existing automated clone arraying technologies. 

When a transversal design is used, contig mapping can be implemented very efficiently, based on 
an algorithm that runs in 0{N + M) time for mapping M contigs onto N BACs. Without going 
into details, the main idea is to first build in 0{N) time a hash table that maps pool pairs to BACs. 
Based on the property of transversal designs that two pools identify a clone, this table contains all 
pool pairs that identify a unique clone. For each contig, it takes 0(1) time using the hash table 
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to either identify the most hkely clone set to which the contig can be mapped, or to declare the 
contig ambiguous. 

3. Sequence assembly 

This section analyzes CAPSS progress in a hybrid project that uses whole-genome and pooled 
shotgun sequences. CAPS sequences are collected using a transversal design with n pool sets, 
i.e., n/2 arrays. In order to derive a probabilistic model for such experiments, we introduce some 
standard simplifying assumptions and the following notations. Assume that every clone has the 
same length L (100-200 thousand base pairs in practice), and that each shotgun sequence has 
the same length £ (e.g., 500 bp). The WGS and CAPS sequences are combined and compared 
to each other to find overlaps between them. Overlapping sequences form islands. Islands with 
two or more sequences are contigs. An overlap between two shotgun sequences is detected if it is 
at least of length 'di where < < 1. Statistics for islands, and gaps between islands are well 
known ([Lander and Waterman T9881 IWendl a nd Waterston 2002 ). We are interested in statistics 
for clone-linked contigs, those that are assigned to BACs using the pooling information. 

Let a be the coverage by CAPS sequences, i.e., if Fp CAPS sequences are collected, then a = 
where A'' is the total number of clones. Let w denote the coverage by WGS sequences, i.e., if F„ WGS 
sequences are collected, then w = where G is the genome length. Notice that = is possible. 
Here we consider the simplest case of assembling the sequence of a single clone that does not overlap 
with any other clone. Such a clone is covered by a total coverage of {a+w). Although we concentrate 
on sequencing a particular clone, the transversal design allows the simultaneous sequencing of 
multiple, possibly overlapping clones by combining WGS sequences with CAPS sequences from 
many (or even all) pools. Regions of overlapping clones have higher coverage since they are covered 
by more CAPS sequences than a single clone. The sequencing of overlapping regions progresses 
thus faster than what is suggested by the statistics for a single clone. We examine the case of 
assigning contigs to overlapping BACs in ^ Two shotgun sequences from different pools suffice 
to assign a contig to a single BAG. In a practical setting, it may be advantageous to require more 
stringent criteria in order to avoid false mappings. Theorem ^ can be readily adapted for such 
criteria, albeit resulting in bulkier formulas. 

Figures [21 and 01 compare different experimental designs based on Theorem ^ and simulations. 
Figure 121 plots the island statistics from the theorem. It illustrates that for lower coverages (about 
c < 4), the ratio of pooled shotgun sequences makes a large difference in the sequencing. This 
difference is mainly shown in the number of clone- linked contigs, as the contig sizes do not differ 
much. At large coverage levels, when sequencing is nearly completed, the impact of pooled sequences 
is less, i.e., WGS sequences can make up for a lower pooled coverage. 

Figure |3t shows that while more arrays increase the sequencing success, the improvements are 
very small after the second array. Notice that if the clones are selected from a minimally overlapping 
tiling path, then no part of the genome is covered by more than two BACs, and thus two arrays 
suffice for the unambiguous mapping of all contigs that cover clone overlaps. Figure (Hb plots the 
N50 values. The N50 contig length is the value / such that half of the sequenced nucleotides 
belong to contigs of length at least /. The statistics for all designs converge to those of a non- 
pooled sequencing project as the coverage increases. In other words, the negative effects of pooling 
diminish and the project progresses just as without pooling: for example, at total coverage 4-5X, 
99% of the clone is sequenced. 

Theorem 1. Let cr = 1 — where 'd is the fraction of length two shotgun sequences must share 
in order for the overlap to be detected. Consider a BAC that does not overlap with other clones. 
Define c = w + a, the total coverage. Let Xi = — X2 = ^, and Yi = \ — {l — e~'^'^)Xi for i = 1,2. 
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The expected length of a clone-linked contig can be written as ^Aiink where Xunk is bounded 
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Proof. The proof relies on a Poisson process model, following the technique of Waterman (1995) 
We model the location of the shotgun sequences as a Poisson process with rate c. Define fi = a/c, 
the fraction of CAPS sequences. Every sequence is either a WGS sequence with probability (1 — //), 
or comes from each one of the clone's pools with probability fi/n. First we state the well-known 
facts (|Lander and Waterman 19881 IWaterman 1995|l about apparent islands, whether or not they 
are linked to a clone. The event E that a given shotgun sequence is the right-hand end of an 
apparent island has probability J = ¥E = e~'^" . For the A;-th read, define as the number of 
reads from its right-hand end until the first gap towards the left. The probability that an island 
has j sequences in it equals 

p{Mfc=i|F} = (l-J)^-V. 

An island can be mapped to a clone if it contains sequences from at least two pools. The 
probability of mapping the island ending at the A;-th read (event Dk) depends on the number of 
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Figure 2. CAPSS (Theorem^: clone- linked contig statistics. The values are calculated 
from Theorem ^ for two-array transversal designs and different pooled coverage levels a. 
Overlaps between shotgun sequences are detected with i? = 0.1. The number of contigs on 
the right-hand side is given in multiples of L jl. The abscissa is the total coverage c. 




Figure 3. CAPSS (Theorem^: sequencing progress. The left-hand side plots the fraction 
of bases covered by clone-linked contigs as a function of total coverage (c ~ a+w) for different 
designs. Notice that the improvement from two arrays to four arrays (n = 4 vs. n = 8) is 
marginal. The right-hand side plots the N50 values for different designs with two arrays, as 
multiples of I. All values were calculated with shotgun sequence overlap detection -d = 0.1. 
The N50 plot was obtained from simulation: each point is an average of 200 measurements. 



shotgun sequences in the island. Using inclusion-exclusion: 
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By Equation (jSJ, the number of shotgun sequences in a clone-hnked island is distributed by the 
probabihties 
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In Equation Punk = IP|^fe Equation follows from the fact that the expected number 

of shotgun fragments covering the clone equals cL/i. 
By definition of the conditional probability, 
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By Equation ©, 
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The value Acbc in the theorem is the expected island length in a non-pooled sequencing project. 
By Equation ((TJ, and the fact that limc^ooPiink = 1; we have limc^oo Aunk = Acbc when the ratio 
of CAPS sequences is kept constant. This limit result is not surprising given that every island can 
be assigned to a clone with near certainty when the sequence read coverage is large. 

4. Clone overlap detection 

The key observation for this section is that a transversal design makes it possible to map a 
contig unambiguously to more than one BAC at once. Now, a contig that is mapped to two clones 
simultaneously can be viewed as evidence that the two clones overlap. Taking the idea further, an 
entire set of BACs can be tested for overlaps in this manner, which leads us to the Clone-Array 
Pooled Shotgun Mapping (CAPS-MAP) method that is described as follows. A redundant collection 
of random BACs covering a large genome is grouped into subsets of size q'^ . Pooled shotgun sequence 
reads are collected from each clone group using a transversal design with d arrays of size q x q. 
Partitioning into subsets may be dictated by the practical concerns of chemistry, biology and robotic 
automation. For array sizes that are multiples of 8 or 12 or both (yielding standard dimensions of a 
96- well microliter plate), such as g = 24, or q = 48, there exist known (jColbourn and Dinitz 1996|) 
transversal designs. A pooling design with a few (d = 2, 3, 4) arrays suffice to compute the physical 
ordering of BACs in the library, depending on the library's redundancy and the array sizes. In 
addition to the CAPS sequences, WGS sequences are used to increase read contig lengths. The 
shotgun sequences are compared to each other to find the overlaps between them, and are assembled 
into contigs. Contigs that map unambiguously to more than one clone are taken as evidence that 
the clones overlap. See Figures 0] and El for illustrations. The clone overlap information can then 
be used to compute the physical ordering of the BACs in the library, and to select a minimal tiling 




Figure 4. CAPS-MAP detects overlaps between clones by identifying situations where a 
read contig maps simultaneously to two clones. This figure illustrates a transversal pooling 
design with two clone groups and two arrays per group. The transversal design guarantees 
that the intersection of any two pools out of four possible for each BAG (two row and two 
column pools) uniquely identifies the BAG. Note that overlaps between clones on the same 
array can also be detected by a transversal design. 
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Figure 5. Overlaps between clones on the same array can also be detected by a 
transversal design, even in the presence of false negatives, i.e., situations where a 
particular BAG is not represented in a particular pool. Specifically, overlap between 
the two BACs illustrated in the figure is detected despite the fact that each BAG is 
sampled in only three pools. 



path for complete sequencing, just as if the overlaps were detected using a fingerprinting scheme 
(IMarra et al. 19971) . 

Theorem [^considers the case of detecting an overlap between two clones in different clone groups. 
Similar analyses can be carried out for more general cases with more overlapping clones, or clones 
in the same clone group, resulting in more cumbersome formulas. 
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Figure IHl plots the overlap detection probabilities in a few scenarios with different amounts of 
CAPS and WGS sequences. Based on the figure, the probability of detecting an overlap increases 
exponentially toward 1 with the overlap length. The same exponential behavior is characteristic 
of clone anchoring methods for overlap detection (jArratia et al. 1991(1 . Consequently, clone contig 
statistics for CAPS-MAP can be calculated using a clone anchoring model with an appropriate 
anchoring process intensity. Clone contig statistics can also be estimated using a fingerprinting 
model (Lan der a nd Waterman 1988 ) by noticing that clone overlaps above a certain length are 
detected with near certainty. Figure IHl indicates that using IX CAPS coverage and 2-5X WGS 
coverage, BAC overlaps of more than 20000 bp are detected almost certainly. While CAPS-MAP 
uses only the fact that a contig is mapped to multiple BACs, and not the actual contig sequence, 
the sequence information is used in the ensuing sequencing phase, and thus CAPS-MAP represents 
very little overhead in a genome sequencing project. 

It is worth pointing out here that CAPS-MAP detects very short, or even negative clone overlaps 
with non-negligible probability. A short region of the genome that is not covered by BACs in the 
library can be bridged by WGS sequences. The bridging WGS sequences may form a contig with 
CAPS sequences from the two BACs at the gap's ends that can be mapped to the two clones 
simultaneously. This unique feature of CAPS-MAP among clone overlap detection methods does 
not interfere with the calculation of the physical ordering of BACs. At the same time, it does 
decrease the necessary BAC library size for sequencing the genome completely. After the clones are 
selected for complete sequencing, the already collected WGS sequences are included in the genome 
sequence assembly. Consequently, negative overlaps detected by CAPS-MAP are already covered 
by shotgun sequences in the sequencing phase, and pose no additional requirements for shotgun 
sequence collection. 

Theorem 2. Let two clones from different clone groups share an overlap. Define C2 = 2a + w, the 
total shotgun sequence coverage for the overlap. Define 

w + {l + ^)a w + a w+'^ w + ^ w_ 

Pi = P2 = P3 = P4 = P5 = —, 

C2 C2 C2 C2 C2 

7, = 1 - (1 - e-'=2'^)A fori = l,...,5. 

(i) An apparent island in the overlap consisting of j > shotgun sequences is mapped to the 
two clones simultaneously with probability 1 — q{j) where 

(14) q{j) = 2np( - 2(n - l)/3^ - n^/?^ + 2n{n - l)(3{ - (n - if (31 < 2n(3{. 

(ii) An apparent island covering the overlap is mapped to the two clones simultaneously with 
probability 

/lt-\ 1 -C20-A Pi r,l T\P2 2/^3 , r, / Pa / T \2 P5 

(15) p2 = I — e ^ I 2n 2[n — 1) n h 2n[n — 1) (n — 1) — 

V 71 72 73 74 75 

Proof. The overlap is detected if it is covered by an island that can be simultaneously mapped to 
the two clones. We model the location of the shotgun sequences as a Poisson process with rate C2. 
Define ^2 = -, the fraction of CAPS sequences covering the overlap. Every shotgun sequence is 
either a WGS sequence with probability (1 — ^12), or comes from each one of the two clones' pools 
with probability /U2/(2n). The event E2 that a given shotgun sequence is the right-hand end of an 
apparent island has probability J2 = P-E2 = e~^^^ . For the k-th sequence, define as the number 
of sequences from its right-hand end until the first gap towards the left. The probability that an 
island has j sequences in it equals 



[Mk=j I E2] = (l-'/2)' 'j2. 
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The probability of mapping the island that ends at the k-th. shotgun sequence (event Dk) depends 
on the number of sequences in the island. We calculate the probability of event -D^ in separate 
cases. Let pofl{j) denote the event that the island consists of WGS sequence reads only given that 
it has j reads. Then 



(16a) 



PoflU) = (1 -/^2F- 



Let Po,*ij) denote the event that the island consists of CAPS sequences for one clone only and 
WGS sequences, given that it has j shotgun sequences in it: 



(16b) 



2 / ■ 



Let pifi{j) denote the event that the island consists of CAPS sequences from a fixed pool and WGS 
sequences, given that it has j shotgun sequences in it: 
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Let Pi,i(i) denote the event that the island consists of CAPS sequences from a fixed pool for one 
clone, from another fixed pool for the other clone, and WGS sequences, given that it has j shotgun 
sequences in it: 
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Let Pi,+(i) denote the event that the island consists of CAPS sequences from a fixed pool for one 
clone, at least one CAPS sequence for the other clone, and WGS sequences: 

1 



(16e) Pi,+U) 
Using inclusion-exclusion. 
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E2,Mk=j 



} = (2^'o,*(i) -PofiU)) +2npi,+(j) -nV,i(i)- 



By Equations (|16aH16e|l . 



(17) P| 



2n 1 



n 



1 



2n 
n — 1 

n 



2(n-l)(l-fy 



/U2 +2n(n-l) 1 



n- 2 



n 



^2 



(n- 1)2(1 -;,2)^ 



which corresponds to Equation ()14|) with q[i) 
as before 



'{ 



<}■ 



E2,Mk = jf- Using the same technique 



l-p2 = F[Dk I ^2} = ^IP{l)fc 



E2,Mk 



E, 



leading to Equation (fT3)) . 

Recall that q{j) is the probability of failing to map a contig of j reads to the two clones simul- 
taneously. In order to show that the inequality in Equation (|14() holds, we prove that 



(18) 



q{j) < 2nf3^ - (2n - 1)/?^ < 2n/3^'. 



Notice that (3^ < (54, < (5^ < (32 < Pi and thus q{j) / 2ni5{. Since (i^ = (/^s + f3^)/2, it follows from 
the convexity of that 



(19) 



2Pi<Pi + Pl 
11 



(Alternatively, notice that the same inequality follows from > in Equation HlGdl) .) We 

proceed by rearranging the equality of Equation H14() : 



2npi - {2n - - q{j) = 2{n - l)[5{ + (n - 1)^/3^ - 2n(n - + (n - 



> by Eq. > since (32 > Pa 

which proves Equation (|18|) . □ 

It is difficult to derive useful closed formulas for the probability of overlap detection. For example, 
based on Equation (|15jl . the number of contigs in the overlap that are simultaneously mapped to 
the clones can be modeled as arrivals in a Poisson process with intensity C2e~'^^"p2- For practical 
values of C2, this model seriously underestimates the probability of overlap detection. The problem 
is similar to the one of using Lander- Waterman statistics ([Land er and Waterman 1988) at high 



coverage levels (see Wendl and Waterston (2002) for a discussion). For a more suitable model 



let G be the number of gaps entirely contained in the overlap, and number the islands from to G. 
Let jo,j2, ■ ■ ■ iJg denote the number of shotgun sequences in the islands. The probability that none 
of the islands can be mapped simultaneously to the two clones can be calculated as 

G 

(20) Pnomap(jo, • • • ,Jg) = H^^-^*)' 

i=0 

where q{j) is defined by Equation p4j) . (Notice that G and the ji are random variables.) We are 
interested in the expected value Pnomap = IEpnomap(joi ■ ■ ■ iJg)- In order to get a good assessment 
of CAPS-MAP performance, we found that it is best to use a Monte-Carlo estimation^ of this 
expected value; see Figure IHl For an alternative, observe that the inequality of Equation H14() 

implies Pnomap < IE /?/^(2n)'^+^ where R is the number of sequences in the overlap, and thus 

R = Ylf^Qji- Based on this observation, we derived bounds (see Appendix) that are useful for 
large values of C2 (e.g., C2 = 7), but at lower coverages, this approach also underestimates the 
overlap detection probabilities significantly. 

5. BAG ORDERING 

Our analyses so far have focused on detecting BAC overlaps via CAPS-MAP. This localized 
perspective was partly adopted to ease the theoretical analysis. In practice, mapping is performed 
based on a global clone-contig incidence matrix. The global approach exploits the dependencies 
in the collected data for increased accuracy. The algorithmic issues are very similar to those 
encountered in the context of STS-based physical mapping (Gusfield 1997). Define the mapping 
matrix M in which the rows correspond to the BACs, the columns correspond to the contigs, and 
M[i, j] = 1 if contig j is linked to clone i, otherwise M[i, j] = 0. We want to find the true ordering 
of the rows and columns, defined by their physical locations on the genome. Assume for a moment 
that the matrix is completely error-free, i.e., all contigs are correctly assembled, and all contig-clone 
overlaps are detected. It is not hard to see that the row and column permutations corresponding 
to the correct ordering result in a matrix M' that satisfies the consecutive ones property (CIP) 



"'^Specifically, for every overlap size considered, we carried out a number of simulated experiments. Each experiment 
used a fixed number of shotgun sequences R placed randomly in the overlap, and produced an instance of a (jo, . . . , jc) 
vector, for which Pnomap(jo, • • • , Jg) was calculated using Equation 1141 . The average of these values was used to 
estimate pnomap- The average was weighted with the probabilities of different R values, given by a Poisson distribution. 
The set of R values was chosen so that it provided a sufficient accuracy for the weighted average estimate. For every R, 
ten thousand experiments were performed. 
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Figure 6. Clone overlap detection. The graph shows the probability of not detecting an 
overlap between two clones, as a function of the overlap size. The plots were calculated by 
a Monte-Carlo method using Theorem|21 All plots use — 0.1 for shotgun sequence overlap 
detection, and £ = 500 for shotgun sequence length. 

in the rows and the columns: for every row i, there exist a < b with M'[-i,j] = 1 if and only if 
^ j ^b, and the same property holds for the columns. (A sufficient condition ensuring row-wise 
(or column- wise) CIP is that if the left endpoint of a contig (or a clone) precedes the left endpoint 
of another one, the same holds for their right endpoints.) Finding such permutations is a well- 
known problem ({Booth and Lueker 1976j) . and can be done in linear time. When the matrix is not 
error-free, one can use techniques introduced for STS-based physical mapping. In ^Hlwe detail a 
method that relies on traveling salesperson tours. 

6. CAPS-MAP SIMULATION OF DrOSOPHILA ASSEMBLY 

We tested the CAPS-MAP approach by simulating the assembly of the Drosophila melanogaster 
genome. One of the main goals of the simulated assembly was to predict the performance a 
hybrid approach combining WGS and CAPS sequences in a project that closely resembles the 
setup of the honey bee genome project's (http://www.hgsc.bcm.tmc.edu/projects/honeybee/), 
currently pursued at the Human Genome Sequencing Center (HGSC) of Baylor College of Medicine. 

Concatenating all the Drosophila genome sequence (Release 2.5, 112.6 million bases), 2880 BAC 
sequences were generated by randomly picking their locations and lengths. The mean BAC insert 
length was 150 kbp, and its standard deviation was 500 bp. The resulting random BAC library pro- 
vides 3.6X coverage of the genome. BACs were arrayed by first partitioning them into 5 groups, and 
then using a two- array 24 x 24 transversal design for each group. Every BAC was covered by 1.2X 
CAPS sequences: 0.4X per pool on the first array and 0.2X per pool on the second (reshuffled) array. 
In addition, WGS sequences were produced at 4X genome coverage. The shotgun sequences were 
generated using the program wgs-simulator (written by K. James Durbin), which mimics shot- 
gun sequence collection realistically by relying on sequence quality files ( |Ewing and Green 1998 1 
produced in sequencing projects. 

Shotgun sequences were assembled into contigs using the Atlas suite of genome assembly tools 
(http : //www.hgsc .bcm. tmc . edu/downloads/soft ware/at las/) and Phrap (http : //www.phrap . org/). 
A contig was mapped to a clone if it contained sequences from all four clone pools. Contigs that 
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Minimum bactig size 


Genome covered 


Number of BACs in bactigs 


2 


97.1% 


2758 


3 


96.7% 


2746 


5 


94.9% 


2714 


10 


88.5% 


2565 


15 


82.4% 


2400 


20 


77.9% 


2284 


30 


65.0% 


1945 


51 


50.9% 


1521 


60 


40.3% 


1195 



Table 1. Statistics for simulated Drosophila assembly. This table details the 
genome and BAG library coverage by bactig sizes. More than half of the genome is 
covered by bactigs with at least 51 BACs in them, defining the N50 statistic for the 
clone map. 



mapped to more than one BACs provided the evidence of BAC overlaps. BACs were grouped into 
maximal overlapping sets, or bactigs. 

We compared the overlap graphs to assess CAPS-MAP overlap detection. The vertices of the 
overlap graphs are the BACs, and two BACs are connected if there is an overlap between them. 
The true overlap graph for the original BACs contains 2880 vertices, and 10992 edges in 66 graph 
components. The overlap graph calculated from the bactigs has 9193 edges in 110 components. 
Among its edges, 8527 (93%) are correct, and 2465 (22%) of the true overlaps are not discovered. 
The median length of detected overlaps is 87 kbp, and the median length of undetected overlaps is 
42 kbp. There are 666 edges that correspond to no real overlaps. The vast majority of these "false 
positives" are instances when a long read contig links several BACs, which do not always overlap 
pairwise. All but two of the CAPS-MAP bactigs are true overlapping sets of BACs. CAPS-MAP 
links the assembled contigs to BACs correctly even in these two bactigs: the source of the error is 
the read contig assembly. Tabled shows statistics on the bactig sizes and genome coverage. 

BACs were ordered within each bactig. For every bactig, an overlap matrix M was calculated, in 
which the rows correspond to the bactig's clones, the columns correspond to the contigs linked to 
at least one bactig clone, and M[z, j] = 1 if contig j is linked to clone i, otherwise M[i, j] = 0. The 
following traveling salesperson (TSP) formulation is used to find the correct column permutation. 
We search for a tour in a graph, in which every vertex corresponds to a contig (and thus a column), 
with an additional vertex uq. The weight of an edge between vertices u and u' , corresponding to 

contigs j and j' , is the number of rows in which they differ: w{u,u') = ^ M[i,j']|, 

where is the indicator function. The weight of an edge between u and uq is the sum of ones 
in the column j that corresponds to u: w{u,uo) = X]jM[z,j]. Now, a Hamilton path with the 
minimum weight in this graph gives the best column permutation in the sense that it minimizes the 
number of gaps between blocks of ones within rows (Alizadeh, Karp, Newberg, and Weisser 1995). 



The best row ordering could be found in an analogous manner, but we used a simpler method 
which worked better in practice. Clones are ordered relatively to the contig order by placing 
clone B before B' if the first contig B is linked to is before the first contig B' is linked to, or if 
their first contigs are identical but B has its last contig before B'. 

We used the concorde program (Applegate et al. 19991 to solve the TSP instances. The resulting 



row permutation is then further analyzed to find clones, for which the permutation arbitrarily 
enforces an order. Specifically, if consecutive rows of the permuted matrix M' are identical, then 
the order of the corresponding clones is not resolved. Subsequently, we compared the TSP orders to 
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Bactig 23 (72 BAOs) 




, Bactig 16 (50 BACs) 

ISOkbp 




Figure 7. Correctness of BAG ordering in Drosophila simulation. The top (Loc) 
of each graph shows the relative physical location of each BAG, the middle (True) 
shows the correct BAG order, and the bottom (TSP) shows the TSP order, and the 
BAG identifiers. Identical BAGs are connected in order to display the differences 
between the two permutations. The order of BAGs at the bottom is not resolved 
when they are connected with a horizontal line. By resolving them optimally, bactig 
23 produces the order of 72 BAGs with 12 breakpoints and bactig 16 orders 50 
BAGs with 9 breakpoints. (Breakpoints are neighbors in the TSP order that are 
not neighbors in the true order.) 

the true orders, which is known since the BAG sequences are generated artificially. Figure [3 shows 
the outcome of the comparison for two bactigs. The TSP order is very close to the true order. 

7. Discussion 

The experimental expedience of shotgun sequencing has been essential for the success of genome- 
scale sequencing projects in the past decade. The power of the concept comes from the now 
established fact that the loss of information about read localization incurred by random subcloning 
can be largely recovered in the assembly step using sequence information. Glone pooling is similar 
in spirit to shotgun sequencing in that it introduces experimental expedience by dramatically 
reducing the number of subclone library preparations. The clone pooling step leads to a temporary 
loss of information about localization of shotgun sequences on individual BAG clones. We have 
demonstrated that sequence information can be used to successfully recover most of the information 
lost in pooling. 

Our analyses presented here indicate the theoretical feasibility of the GAPS-MAP method and 
provide guidance for the design of genome-scale GAPS-MAP experiments. In particular, our anal- 
ysis indicates that transversal pooling designs can accommodate high levels of clone redundancy 
and perform well even at low levels of shotgun sequence coverage of clone pools. 

Practical biological and technical considerations may set a limit to the array size. In case of large 
genomes, the limitations may imply that the set of BAGs is partitioned and that pooling is applied 
separately to individual subsets. This results in a lower clone redundancy within individual arrays 
and a larger number of pools. Our analysis allows for the partitioning of clones. It also allows 
for the possibility of including whole-genome shotgun sequence reads. It thus covers realistic and 
practical scenarios of the GAPSS and GAPS-MAP methods' application. 
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Appendix 

Here we expand our discussion on the probability of overlap detection in CAPS-MAP. In par- 
ticular, we derive formulas that show the exponential decay of the probability of not detecting an 
overlap when the coverage C2 is not too small. We start with the bound 



(21) 
Define 



Pnomap < E f3i {2n) 



9r{z) = E 



R 



the probability generating function for the distribution of the number of gaps conditioned on the 
number of shotgun sequences. Define the events Ai for i = l,...,r— 1: Ai denotes the event that 
the i-th sequence is followed by a gap, conditioned on the event {R = r}. For arbitrary g, and set 
of indexes ii < i2 < ■ ■ ■ < ig, 

¥{Ai,A,,...A,^] = {l-g6)l, 
where 5 = and = max{0, x} (jEwens and Grant 200HrWendl and Waterston 2002|) . Let 

So = l 

'r - V 



Using inclusion-exclusion, 



ll<-<lg 



G = g 



j=g 



Hence, 



r— 1 r— 1 



9=0 3=9 

r-1 j 

j=0 9=0 
3=0 



Substituting the Sj values: 
(22) 



9r{z) 



r-1 

E 

3=0 



r — 1 
j 



a result interesting on its own. 

Returning to Equation l\'21^ . we have 



(23) 



Pnomap ^ E 



R-1 



2n/?f y: 



3=0 



R-1 



{l-jS)^(2n-iy 



where R is a Poisson random variable with mean 



C2GL 
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region of exponential bound for overiap detection 




Figure 8. Values of the pooled shotgun coverage a and WGS coverage w, for which 
the clone overlap detection bound applies, are above the graphs (see Equation (^5]) ) 



For every r > 0, (1 — j5)'^ < e hence 



r-l 

E 

j=0 



r — 1 
3 



(1 - jdf, (2n - ly < 1 + (2n - l)e-''' 



r-l 



Consequently, by Equation 

Pnomap ^ ^ 



2n/?f 1 + (2n - l)e 



-R& 



R-l 



Recall that the random value we take the expectation of is an upper bound on Pnomap(io; • • • iJc) 
and thus if it is larger than one, it is useless. Let 

/(r) = min|l,2n/?[ (^1 + (2n - 1)6-'''^^'^}. 
So we have in fact the bound 

(24) Pnomap < E/(i?). 

In order to achieve exponential decay in the bound, we would like to have 

/3i(l + (2n-l)e-'^«^) < 1 
for some rg < A. Rearranging the inequality, we have 

(25) 



(2n - i) ^^° + ^{ + " < e(2-+-)- 



(n — l)a 

which is satisfied when a and w are not too small (see Figure |HI). 

There are several possible ways to exploit the fact that the exponential component of /(r) 
becomes for r less than the expected value A. The main idea is that when evaluating E,f{R) = 

f{r)¥{R = r} in Equation H24j) . either the probability of = r is small, or the value of /(r) is 
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small. Let < A; < A be a threshold (that we specify later), and let a = k/\. To proceed with 
Equation (|24() . we condition on the event {R < a\}. We use the bound 

-A(l-a)2/2 

(26) ¥{R < aX} < 



(1 — a)V2'iraX ' 
which we prove here quickly. By definition, 

r=0 r=0 



kl (1 - a)V2TTaX' 



where we used a Stirling approximation: k\ > {k/ e)^ / ^/2-Kk. Using a Taylor series expansion, 

1 - Q + alna = ^(1 - a)2 + 1(1 - af + ^(1 - a)^ . . 

and thus 1 — a + a In a > ^(1 — a)'^ for < a < 1, and Equation H26|) follows. 
Now, 



E/(i?) =E f{R) 



R<aX 



F{R<aX} + E f{R) 



R> aX 



> aX} 



< F{R < aA} + E f{R) R > aX 



(^/3i(l+(2rt-l)e-"'5^)^ 



(l-a)^/2^ ' 1 + (2n - l)e-"'5A 

exp(-A(l - a)V2) 2nexp(-A(l - /3i(l + (2n - l)e— ^-))) 



(1 - Q)V27raA 



+ 



1 + (2n - l)e-"'=2'T 



where we used (5A = C2cr. Figure shows values of a for different a, li; pairs that balance the 
exponents in the two terms. 

After choosing a balancing a value for a given (a, w) pair, we obtain 



Ef{R) < Xi exp(-X2CjL), 



where Xi and X2 are constants that do not depend on a. The bound becomes small (< 10~^) 
for larger C2 values (e.g., C2 = 7), but even then, it is not very tight. Based on simulation results, 
the tightness is lost with the inequality of Equation ()21|) . and not in the following steps. For 
example, we evaluated the bounds of Equations H23|) and (|24() numerically. While they are fairly 
close to each other, and to the exponential bound using a, they already bound the expected value 
of Equation (|2U() rather loosely in many cases. Furthermore, even for (a, w) pairs for which we 
cannot establish exponential decay using the inequality of Equation 1)21(1 . the overlap detection 
probability may get very close to one. For instance, a two-array design with a = 0.5 and w = 2 falls 
below the curve of Figure |H1 yet can be employed efficiently in CAPS-MAP as shown in Figure El 
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Figure 9. Balanced a values for our exponential bound. 

Therefore, we prefer using a Monte-Carlo evaluation of Equation (|2())1 to predict the experimental 
performance of CAPS-MAP. 
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